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Abstract 


The conventional classification schemes — notably multinomial logistic regres¬ 
sion — used in conjunction with convolutional networks (convnets) are classical 
in statistics, designed without consideration for the usual coupling with convnets, 
stochastic gradient descent, and backpropagation. In the specific application to su¬ 
pervised learning for convnets, a simple scale-invariant classification stage turns 
out to be more robust than multinomial logistic regression, appears to result in 
slightly lower errors on several standard test sets, has similar computational costs, 
and features precise control over the actual rate of learning. “Scale-invariant” 
means that multiplying the input values by any nonzero real number leaves the 
output unchanged. 


1 Introduction 


Classification of a vector of real numbers (called “feature activations”) into one of several discrete 
categories is well established and well studied, with generally satisfactory solutions such as the 
ubiquitous multinomial logistic regression reviewed, for example, by |Hastie et al. ( 2009| l. However, 
the canonical classification may not couple well with generation of the feature activations via con¬ 
volutional networks (convnets) trained using stochastic gradient descent, as discussed, for example, 
by [LeCun et al. ( 1998| l. Fitting (also known as learning or training) the combination of the con- 
vnet and the classihcation stage by minimizing the cost/loss/objective function associated with the 
classihcation suggests designing a stage specifically for use in such joint htting/learning/training. 
In particular, the convnets presented in the present paper are “equivariant” with respect to scalar 
multiplication — multiplying the input values by any real number multiplies the output by the same 
factor; the present paper leverages this equivariance via a “scale-invariant” classification stage — 
a stage for which multiplying the input values by any nonzero real number leaves the output un¬ 
changed. The scale-invariant classihcation stage turns out to be more robust to outliers (including 
obviously mislabeled data), hts/leams/trains precisely at the rate that the user specihes, and appar¬ 
ently results in slightly lower errors on several standard test sets when used in conjunction with some 
typical convnets for generating the feature activations. The computational costs are comparable to 
those of multinomial logistic regression. Similar classihcation has been introduced earlier in other 
contexts by Hill & Doucet ( 2007|l,[Lange & '\W ( 2008j l, |Wu & Lan^ ( |2010] l, |Saberian & Vascon-j 


celos (2011 1 , Mroueh et a 


(|2012|l, Wu & Wu (2012|l, and others. Complementary normalization 


includes the work of Carandini & Heeger^012|), Ioffe & Szegedy (|2015|l, and the associated refer¬ 


ences. The key to effective learning is rescaling, as described in Sechonj^below (see especially the 
last paragraph there). This rescaled learning, while necessary for training convolutional networks, 
is unnecessary in the aforementioned earlier works. 


The remainder of the present paper has the following structure: Section sets the notation. Sec- 
tionj^introduces the scale-invariant classihcation stage. Sectionj^analyzes its robustness. Section]^ 
illustrates the performance of the classihcation on several standard data sets. Section [^concludes 
the paper. 
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2 Notational conventions 


All numbers used in the classification stage will be real valued (though the numbers used for gen¬ 
erating the inputs to the stage may in general be complex valued). We follow the recommendations 
of Magnus & Neudecker (2007 1 : all vectors are column vectors (aside from gradients of a scalar 
with respect to a column vector, which are row vectors), and we use ||u|| to denote the Euclidean 
norm of a vector u; that is, ||u|| is the square root of the sum of the squares of the entries of v. We 
use ||A|| to denote the spectral norm of a matrix A; that is, ||A|| is the greatest singular value of 
A, which is also the maximum of ||Au|j over every vector v such that ||u|| = 1. The terminology 
“Frobenius norm” of A refers to the square root of the sum of the squares of the entries of A. The 
spectral norm of a vector viewed as a matrix having only one column or one row is the same as the 
Euclidean norm of the vector; the Euclidean norm of a matrix viewed as a vector is the same as the 
Frobenius norm of the matrix. 


3 A SCALE-INVARIANT CLASSIFICATION STAGE 


We study a linear classification stage that assigns one of k classes to each real-valued vector x of 
feature activations (together with a measure of confidence in its classification), with the assignment 
being independent of the Euclidean norm of a:; the Euclidean norm of x is its “scale.” We associate 
to the k classes target vectors <i, t 2 , ..., ffe that are the vertices of either a standard simplex or 
a regular simplex embedded in a Euclidean space of dimension m > k — the dimension of the 
embedding space being strictly greater than the minimum (k — 1) required to contain the simplex 
will give extra space to help facilitate learning; Hill & Doucet ( 2007 1, Lange & Wu (|2008 1 , Wu & 
Lange ( 2010| l, Saberian & Vasconcelos (2011 1 , Mroueh et al. \ 2Q12\ , and |Wu & Wu ( 2012| l (amongst 
others) discuss these simplices and their applications to classification. For the standard simplex, the 
targets are just the standard basis vectors, each of which consists of zeros for all but one entry. For 
both the regular and standard simplices. 


I|tl||=p2|| = ...= 11411=1. 


( 1 ) 


Given an input vector x of feature activations, we identify the target vector tj that is nearest in the 
Euclidean distance to 

^ ( 2 ) 


2 = 


bir 


where 

y = Ax (3) 

for an m X n matrix A determined via learning as discussed shortly. The index j such that jjz — jj 
is minimal is the index of the class to which we assign x. The classification is known as “linear” or 
“multi-linear” due to ([^. The index j to which we assign x is clearly independent of the Euclidean 
norm of x due to and the assignment is “scale-invariant” even if we rescale A by a nonzero 
scalar multiple. 

To determine A, we first initialize all its entries to random numbers, then divide each entry by the 
Frobenius norm of A and multiply by the square root of t he number of rows i n A. We then conduct 
iterations of stochastic gradient descent as advocated by LeCun et al. (19981, updating A to A on 
each iteration via 

^ (4) 


A = A-h 


dA 


where is a positive real number (known as the “learning rate” or “step length”) and c is the cost to 
be minimized that is associated with a vector chosen at random from among the input vectors and 
its associated vector x of feature activations. 


c = \\z — 


(5) 


where tj is the target for the correct class associated with x, and z is the vector-valued function of x 
specified in (|^ and ([^. 

As elaborated by [LeCun et al. (19981, usually we combine stochastic gradient descent with back- 
propagation to update the entries of x associated with the chosen input, which requires propagating 
the gradient dcjdx back into the network generating the feature activations that are the entries of x 
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for the chosen input sample. We use the same learning rate from the classification stage throughout 
the network generating the feature activations. Fortunately, a straightforward calculation shows that 
the Euclidean norm of the gradient dcjdx is bounded independent of the scaling of A: 


dx 


^ 4||A|| 

- ||Aa;||’ 


(6) 


please note that scaling the matrix A by any nonzero scalar multiple has no effect on the right-hand 
side of (j^ — the gradient propagating in backpropagation is independent of the size of A. 

Critically, after every update as in Q, we rescale the matrix A: we divide every entry by the Frobe- 
nius norm of A and multiply by the square root of the number of rows in A. We use the rescaled 
matrix for subsequent iterations of stochastic gradient descent. Rescaling A yields precisely the 
same vector x in Q and cost c in (|^; together with the scale-invariance of the right-hand side of 
rescaling ensures that the stochastic gradient iterations are effective and numerically stable for any 
learning rate h. 


4 Robustness 


Combining Q and the fact that the Euclidean norm of 2 ; from Q is 1 yields that the cost c from Q 
satisfies 

0 < c < 4. (7) 


As reviewed by Hastie et al. (2009 1 , the cost associated with classification via multinomial logistic 
regression is 




( 8 ) 




where j is the index among 1, 2, ..., fc of the correct class, and • ■ ■, are the entries 

of the vector y from 0. with m = fc for multinomial logistic regression. Whereas the cost c is 
bounded as in Q, the cost r from ^ is bounded only for positive values of y^^\ growing linearly 
for negative y^^K Thus, c is more robust than r to outliers; logistic regression is less robust to outliers 
(including obviously mislabeled inputs). 


5 Numerical experiments 


The present section provides a brief empirical evaluation of rescaling in comparison with the usual 
multinomial logistic regression, performing the learning for both via stochastic gradient descent (the 
learning is end-to-end, training the entire network — including both the convolutional network and 
the classification stage —jointly, with the same learning rate everywhere). The experiments (and 
corresponding figures) consider various choices for the learning rate h and for the dimension m of 
the space containing the simplex targets. We renormalize the parameters in the classification stage 
after every minibatch of 100 samples when rescaling (not with the multinomial logistic regression), 
as detailed in the last paragraph of Section]^ and the penultimate paragraph of the present section. 
The rescaled approach appears to perform somewhat better than multinomial logistic regression in 
all but Figure for the experiments detailed in the present section. The remainder of the present 
section provides details. 


Following LeCun et al. (19981, the architectures for generating the feature activations are convo¬ 
lutional networks (convnets) consisting of series of stages, with each stage feeding its output into 
the next (except for the last, which feeds into the classification stage). Each stage convolves each 
image from its input against several learned convolutional kernels, summing together the convolved 
images from all the inputs into several output images, then takes the absolute value of each pixel of 
each resulting image, and finally averages over each patch in a partition of each image into a grid 
of 2 X 2 patches. All convolutions are complex valued and produce pixels only where the original 
images cover all necessary inputs (that is, a convolution reduces each dimension of the image by one 
less than the size of the convolutional kernel). We subtract the mean of the pixel values from each 
input image before processing with the convnets, and we append an additional feature activation to 
those obtained from the convnets, namely the standard deviation of the set of values of the pixels 
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in the image. For each data set, we use two network architectures, where the second is a somewhat 
smaller variant of the first. We consider three data sets whose training properties are reasonably 
straightforward to investigate, with each set consisting of fc = 10 classes of images; the first two 
are the usual CIFAR-10 and MNIST of Krizhevsl^ ( 2009|l and LeCun et al. (19981. The third is a 
subset of the 2012 ImageNet data set of |Russakovsky et al.| ( 2015| l, retaining 10 classes of images, 
representing each class by 100 samples in a training set and 50 per class in a testing set. CIFAR-10 
contains 50,000 images in its training set and 10,000 images in its testing set. MNIST contains 
60,000 images in its training set and 10,000 images in its testing set. The images in the MNIST set 
are grayscale. The images in both the CIFAR-10 and ImageNet sets are full color, with three color 
channels. We neither augmented the input data nor regularized the cost/loss functions. We used the 
Torch? platform — http;//torch.ch — for all computations. 


Tables [T]-^ display the specific configurations we used. “Stage” specifies the positions of the in¬ 
dicated layers in the convnet. “Input images” specifies the number of images input to the given 
stage for each sample from the data. “Output images” specifies the number of images output from 
the given stage. Each input image is convolved against a separate, learned convolutional kernel for 
each output image (with the results of all these convolutions summed together for each output im¬ 
age). “Kernel size” specifies the size of the square grid of pixels used in the convolutions. “Input 
image size” specifies the size of the square grid of pixels constituting each input image. “Output 
image size” specifies the size of the square grid of pixels constituting each output image. Tables [T] 
and|^display the two configurations used for processing both CIFAR-10 and MNIST. Tables|^and|^ 
display the two configurations used for processing the subset of ImageNet described above. 


Figur es [TH 6]plot the accuracies attained by the different schemes for classification while varying h 
from (|4| (Tils the “learning rate,” as well as the length of the learning step relative to the magnitude 
of the gradient) and varying the dimension m of the space containing the simplex targets; m is 
the number of rows in A from Q and Q. In each figure, the top panel — that labeled “(a)” and 
“rescaled” — plots the error rates for classification using rescaling, with the targets being the vertices 
on the hypersphere of a regular simplex; the middle panel — that labeled “(b)” and “logistic” — 
plots the error rates for classification using multinomial logistic regression; the bottom panel — that 
labeled “(c)” and “best of both” — plots the error rates for the best-performing instance from the 
top panel (a) together with the best-performing instance from the middle panel (b). All error rates 
refer to performance on the test set. The label “epoch” for the horizontal axes refers, as usual, to the 
number of training sweeps through the data set, as reviewed in the coming paragraph. 


As recommended by [LeCun et al. ( 1998[ l, we learn via (minibatched) stochastic gradient descent, 
with 100 samples per minibatch; rather than updating the parameters being learned for randomly se¬ 
lected individual images from the training set exactly as in Section]^ we instead randomly permute 
the training set and partition this permuted set of images into subsets of 100, updating the param¬ 
eters simultaneously for all 100 images constituting each of the subsets (known as “minibatches”), 
processing the series of minibatches in series. Each sweep through the entire training set is known 
as an “epoch.” The horizontal axes in the figures count the number of epochs. 


In the experiments of the present section, the accuracies attained using the scale-invariant classifica¬ 
tion stage are comparable to (if not better than) those attained using the usual multinomial logistic 
regression. Running the experiments with several different random seeds produces entirely similar 
results. The scale-invariant classification stage is stable for all values of h, that is, for all learning 
rates. 


6 Conclusion 


Combining [ 1 ] a convolutional network that is equivariant to scalar multiplication, [2] a classification 
stage that is invariant to scalar multiplication, and [3] the rescaled learning of the last paragraph 
of Section fully realizes and leverages invariance to scalar multiplication. This combination is 
more robust to outliers (including obviously mislabeled data) than the standard multinomial logistic 
regression “softmax” classification scheme, results in marginally better errors on several standard 
test sets, and fits/learns/trains precisely at the user-specified rate, all while costing about the same 
computationally. The attained invariance is clean and convenient — a good goal all on its own. 
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Table 1: CIFAR-10 and MNIST, larger (t=l for grayscale MNIST; t=3 for full-color CIFAR-10) 


Stage 

Input images 

Output images 

Kernel size 

Input image size 

Output image size 

first 

t 

16 

3x3 

32 X 32 

15 X 15 

second 

16 

128 

2x2 

15 X 15 

7x7 

third 

128 

1024 

2x2 

7x7 

3x3 


Table 2: CIFAR-10 and MNIST, smaller (f 

Stage Input images Output images 

=1 for grayscale MNIST; t=3 for full-color CIFAR-10) 

Kernel size Input image size Output image size 

first 

t 

16 

3x3 

32 X 32 

15 X 15 

second 

16 

64 

2x2 

15 X 15 

7x7 

third 

64 

256 

2x2 

7x7 

3x3 


Stage 

Table 3: ImageNet subset, larger (full color, with 3 color channels) 

Input images Output images Kernel size Input image size Output image size 

first 

3 

16 

5x5 

128 X 128 

62 X 62 

second 

16 

64 

3x3 

62 X 62 

30 X 30 

third 

64 

256 

3x3 

30 X 30 

14 X 14 

fourth 

256 

1024 

3x3 

14 X 14 

6x6 


Stage 

Table 4: ImageNet subset, smaller (the smaller number is italicized in this table) 

Input images Output images Kernel size Input image size Output image size 

first 

3 

16 

5x5 

128 X 128 

62 X 62 

second 

16 

64 

3x3 

62 X 62 

30 X 30 

third 

64 

256 

3x3 

30 X 30 

14 X 14 

fourth 

256 

256 

3x3 

14 X 14 

6x6 
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Figure 1; CIFAR-10 — larger architecture 
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Figure 2: MNIST — larger architecture 
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Figure 3: ImageNet subset — larger architecture 
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Figure 4: CIFAR-10 — smaller architecture 
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Figure 5; MNIST — smaller architecture 
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Figure 6; ImageNet subset — smaller architecture 
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